Compositions and Methods for Labeling Target Nucleic Acid Molecules

ABSTRACT

Provided herein are methods and compositions for labeling target nucleic acid molecules with molecular barcodes.

CROSS REFERENCE

This application is a divisional of U.S. application Ser. No.15/051,064, filed Feb. 23, 2016 which claims the benefit of USProvisional Application Nos: 62/119,643, filed Feb. 23, 2015 and62/142,104, filed Apr. 2, 2015, which applications are incorporated byreference herein.

BACKGROUND

Next generation sequencing (NGS) has the potential to be an invaluabletool for the diagnosis and treatment of many different diseases anddisorders. To reduce the cost of sequencing and the burden of dataanalysis, a target enrichment process can be used to increase therelative abundance of target sequences in a nucleic acid library priorto performance of the NGS reaction (e.g., as described US 2014/0287468,which is hereby incorporated by reference in its entirety).

Compatibility between target enrichment methods and multiplex sequencingprocesses is critical because the reduced complexity of enriched targetlibraries requires multiplex sequencing to be cost effective. Nearly allmultiplex sequencing approaches include the labeling of individuallibraries with a library-specific nucleic acid barcode, referred to as a“multiplex identifier nucleotide sequence,” or “MID,” through theaddition of a MID to a platform-specific adapter sequence or a PCRprimer. Because the sequence of the MID corresponds to the originatinglibrary, multiple libraries incorporating distinct MIDs can be combinedand sequenced in a single sequencing reaction and, following sequencing,the MIDs can be used in silico to associate each resulting sequence withthe library from which it originated.

In addition to MIDs, some target enrichment protocols also labelindividual DNA molecules with molecule-specific nucleic acid barcodes,referred to as “unique identifier sequences,” or “UIDs,” such as adegenerate base region (DBR), prior to amplification. The presence ofsuch sequences makes it possible to distinguish unique DNA moleculesfrom PCR duplicates, enabling the more accurate identification andquantification of unique DNA molecules and mutations.

As the diagnostic power of genome and transcriptome analysis increases,improved methods and compositions for the labeling of nucleic acidlibraries with library-specific sequences and/or molecule-specificsequences are desirable to facilitate the use of target enrichmentmethods with multiplex NGS processes.

SUMMARY

Provided herein are methods and compositions for labeling target nucleicacid molecules with molecular barcodes (e.g., with sample-specificbarcodes and/or with molecule-specific barcodes). Certain embodiments ofthe methods and compositions provided herein can be useful, for example,to facilitate the performance of multiplex NGS processes ontarget-enriched polynucleotide libraries and/or for enabling accurateidentification and quantification of unique polynucleotides molecules ina polynucleotide library.

In certain aspects, provided herein is a method of barcode labeling alibrary of polynucleotide molecules. The barcodes can be sample-specificbarcodes (i.e., multiplex identifier sequences, or MIDs) and/or withmolecule-specific barcodes (i.e., UIDs).

In certain aspects, provided herein is a method of barcode labeling alibrary of polynucleotide molecules that include for example, a 3′adapter, a genomic DNA, cDNA or RNA sequence and a 5′ adapter. Incertain embodiments, the method includes a step of contacting thelibrary of polynucleotide molecules with primers (e.g., sample indexprimers) comprising a first hybridization sequence (i.e., a sequencecapable of hybridizing to at least a portion of the 3′ adapter) and,optionally also comprising a sample-specific barcode sequence located 5′of the first hybridization sequence, wherein the primers are contactedwith the polynucleotide library under conditions such that the firsthybridization sequence hybridizes to a region of the 3′ adapter of thelibrary of polynucleotide molecules. In some embodiments, the methodcomprises incubating the hybridized polynucleotide molecules with apolymerase such that the polymerase extends 3′ end of the primers (e.g.,sample index primers) to form primer extension products comprising, in5′ to 3′ order, a known sequence, (optionally) a sample-specific barcodesequence, a sequence complementary to at least a portion of the 3′adapter, a sequence complementary to the genomic DNA, cDNA or RNAderived sequence, and a sequence complementary to the 5′ adapter. Insome embodiments, the method includes contacting the primer extensionproducts with UID-template oligonucleotides comprising, in 3′ to 5′order, a non-extendable 3′ end, a second hybridization sequence (i.e., asequence capable of hybridizing to at least a portion of the sequencecomplementary to the 5′ adapter), and a variable barcode sequence underconditions such that the second hybridization sequence hybridizes to a3′ terminal region of the sequence complementary to the 5′ adapter. Insome embodiments, the method includes incubating the hybridized primerextension products with a polymerase such the polymerase extends the 3′end of the primer extension products to form a further extended primerextension product comprising, in 5′ to 3′ order, a known sequence,(optionally) the sample-specific barcode sequence, the sequencecomplementary to at least a portion of the first adapter, the sequencecomplementary to the genomic DNA, cDNA or RNA derived sequence, thesequence complementary to the second adapter, and a sequencecomplementary to the variable barcode sequence of the oligonucleotide.In some embodiments, the method includes amplifying the further extendedprimer extension products formed in step (e.g., using PCR or otheramplification method known in the art). In some embodiments, the methodincludes the step of denaturing the primer extension products from thepolynucleotide molecules before contacting the primer extension productswith the UID-template oligonucleotides. In some embodiments, the 5′terminus of the sample index primer is protected from exonucleasedigestion and the method further comprises the step of incubating theprimer extension product/polynucleotide library molecule complex with a5′ exonuclease such that a 5′ terminal sequence is removed from thepolynucleotide molecules of the library before contacting the primerextension products with the UID-template oligonucleotides. In someembodiments, the method further comprises the step of performing asequencing process (e.g., a NGS process) on the amplification products.

In certain aspects, provided herein is a method of barcode labelingprimer extension products formed from a library of polynucleotidemolecules. In some embodiments, the method includes contacting primerextension products with UID-template oligonucleotides, wherein: (i) theprimer extension products are formed from a library of polynucleotidemolecules; (ii) the primer extension products comprise, in 5′ to 3′order, a sequence complementary to at least a portion of a firstadapter, a sequence complementary to a genomic DNA, cDNA. or RNA derivedsequence, and a sequence complementary to a second adapter; (iii) theoligonucleotides comprise, in 3′ to 5′ order, a non-extendable 3′ end, ahybridization sequence (i.e., a sequence capable of hybridizing to atleast a portion of the sequence complementary to the 5′ adapter), and avariable barcode sequence; and (iv) the primer extension products arecontacted with the UID-template oligonucleotides such that thehybridization sequence hybridizes to a 3′ terminal region of thesequence complementary to the second adapter. In some embodiments themethod includes the step of incubating the hybridized primer extensionproducts with a polymerase such the polymerase extends the 3′ end of theprimer extension products to form a further extended primer extensionproduct comprising, in 5′ to 3′ order, the sequence complementary to atleast a portion of the first adapter, the sequence complementary to thegenomic DNA, cDNA or RNA derived sequence, the sequence complementary tothe second adapter, a sequence complementary to the variable barcodesequence of the UID-template oligonucleotides, and additional knownsequence. In certain embodiments, the method includes amplifying thefurther extended primer extension products (e.g., using PCR). In someembodiments, the method further comprises the steps of contacting thelibrary of polynucleotide molecules with primers comprising a 3′ adapterhybridization sequence such that the 3′ adapter hybridization sequencehybridizes to a region of the 3′ adapter; and incubating the resultinghybridized DNA molecules with a polymerase such that the polymeraseextends 3′ end of the primers to form the primer extension products. Insome embodiments, the method includes the step of denaturing the primerextension products from the DNA molecules before contacting the primerextension products with the UID-template oligonucleotides. In someembodiments, the 5′ terminus of the sample index primer is protectedfrom exonuclease digestion and the method further comprises the step ofincubating the primer extension product/polynucleotide library moleculecomplex with a 5′ exonuclease such that a 5′ terminal sequence isremoved from the DNA molecules of the library before contacting theprimer extension products with the UID-template oligonucleotides. Insome embodiments, the method further comprises the step of performing asequencing process (e.g., a NGS process) on the amplification products.

In certain aspects, provided herein is a method of barcode labeling alibrary of polynucleotide molecules comprising, in 3′ to 5′ order, afirst adapter, a sequence derived from genomic DNA, cDNA or RNA, and asecond adapter comprising an unreplicable region using a terminaltransferase. In certain embodiments, the method includes contacting thelibrary of polynucleotide molecules with primers comprising a firstadapter hybridization sequence such that the first adapter hybridizationsequence hybridizes to a region of the first adapter. In someembodiments, the method includes incubating the hybridizedpolynucleotide molecules with a polymerase such that the polymeraseextends 3′ end of the primers to form primer extension productscomprising, in 5′ to 3′ order, a sequence complementary to at least aportion of the first adapter, a sequence complementary to the genomicDNA, cDNA, or RNA derived sequence, and a sequence complementary to theportion of the second adapter that is 3′ of the unreplicable region. Insome embodiments, the method includes incubating the products of theprimer extension reaction with a 3′ terminal transferase such that arandom nucleic acid sequence is added to the 3′ terminus of the primerextension products. In some embodiments, the method includes incubatingthe products of the terminal transferase reaction with a polymerase suchthat the polymerase further extends the primer extension products toform a further extended primer extension product comprising, in 5′ to 3′order, the sequence complementary to at least a portion of the firstadapter, the sequence complementary to the genomic DNA, cDNA or RNAderived sequence, the sequence complementary to the portion of thesecond adapter that is 3′ of the unreplicable region, the random nucleicacid sequence, and a sequence complementary to a portion of the secondadapter that is 5′ of the unreplicable region. In some embodiments, themethod further comprises the step of amplifying the further extendedprimer extension product. In some embodiments, the method furthercomprises the step of performing a sequencing process (e.g., a NGSprocess) on the product of the amplification reaction.

In certain aspects, provided herein is a method of barcode labeling alibrary of polynucleotide molecules comprising a 3′ adapter, a sequencederived from genomic DNA, cDNA or RNA, and, optionally, a 5′ adapterusing a terminal transferase. In some embodiments, the method includescontacting the library of polynucleotide molecules with primerscomprising a 3′ adapter hybridization sequence such that the 3′ adapterhybridization sequence hybridizes to a region of the 3′ adapter. In someembodiments, the method includes incubating the hybridized DNA moleculeswith a polymerase such that the polymerase extends 3′ end of the sampleindex primers to form primer extension products. In some embodiments,the method includes incubating the primer extension products with a 3′terminal transferase such that a random nucleic acid sequence is addedto the 3′ terminus of the primer extension products. In someembodiments, a single stranded (ss) adapter is ligated to the 3′ end ofthe random nucleic acid sequence. In some embodiments, a double stranded(ds) adapter is ligated to the 3′ end of the random nucleic acidsequence wherein the adapter comprises a sequence complementary to thenucleic acid sequence on the primer extension product. In someembodiments, the method further comprises amplifying the product of theligation reaction. In some embodiments, the method further comprises thestep of performing a sequencing process (e.g., a NGS process) on theproduct of the amplification reaction.

In any of the above embodiments, the method may further compriseremoving a 5′ terminal sequence from the polynucleotide molecules of thelibrary after the initial extension step and prior to the next extensionstep, to leave a 3′ overhang. In these embodiments the 5′ terminus ofthe sample index primer may be protected from exonuclease digestion andthe 5′ terminal sequence of the polynucleotide molecules of the librarymay be removed by an exonuclease. Alternatively, the second adapter maycomprise one or more deoxyuridines, and the 5′ terminal sequence of thepolynucleotide molecules of the library may be removed using uracil DNAglycosylase (UDG) and the DNA endonuclease VIII, e.g., using USER™ (NewEngland Biolabs, Ipswich, Mass.).

In certain aspects, provided herein is a composition useful in theperformance of the methods described herein. In some embodiments, thecomposition comprises: (a) a mixture of primer extension products formedfrom a library of polynucleotide molecules comprising, in 3′ to 5′order, a first adapter, a sequence derived from genomic DNA, cDNA orRNA, and a second adapter wherein the primer extension productscomprise, in 5′ to 3′ order, a sequence complementary to the firstadapter, a sequence complementary to the genomic DNA, cDNA or RNAderived sequence; and the sequence complementary to the second adapter;and (b) an excess amount of a oligonucleotides wherein theoligonucleotides comprise, in 3′ to 5′ order, a non-extendable 3′ end, asequence capable of hybridizing to the 3′ terminal region of the primerextension products and a variable barcode sequence. In some embodiments,the molar ratio of primer extension product to synthetic oligonucleotideis between 1:1 and 1:1×10¹². For example, in some embodiments, the molarratio of primer extension product to synthetic oligonucleotide is atleast 1:1, 1:10, 1:50, 1:100, 1:500, 1:1000, 1:1×10⁴, 1:1×10⁵, 1:1×10⁶,1:1×10⁷, 1:1×10⁸, 1:1×10⁹ or 1:1×10¹⁰. In some embodiments, thecomposition comprises a mixture of primer extension products that arecomplementary to a ss polynucleotide library comprising an adapter on atleast one end of a sequence derived from genomic DNA, cDNA or RNA; and aterminal transferase. In some embodiments, the composition furthercomprises a polymerase. In some embodiments, the composition furthercomprises dNTPs.

In certain aspects, provided herein is a method for identifying thenumber of unique molecules in a nucleic acid sample, the methodcomprising performing the method described herein to label a library ofpolynucleotide with unique variable sequences; amplifying the library ofpolynucleotide molecules; and sequencing the unique variable sequence.

In certain aspects, provided herein is a kit for performing a methoddescribed herein. In some embodiments, the kit comprises a nucleic acidmolecule encoding a 5′ adapter sequence and oligonucleotides comprising,in 3′ to 5′ order, a non-extendable 3′ end, a sequence capable ofhybridizing to the 3′ terminal region a sequence complementary to the 5′adapter sequence and a variable barcode sequence. In some embodiments,the kit further comprises a nucleic acid molecule encoding a 3′ adaptersequence and a sample index primer comprising, in 5′ to 3′ order, asample-specific barcode sequence and a sequence capable of hybridizingto at least a portion of the 3′ adapter sequence. In some embodiments,the kit further comprises a first primer capable of hybridizing to asequence complementary to a sequence of the sample index primer located5′ of the sample-specific barcode sequence and a second primer capableof hybridizing to a sequence complementary to a sequence of theoligonucleotides located 5′ of the variable barcode sequence.

In any of the above methods, compositions or kits, the 5′ end of thesynthetic oligonucleotide may be protected from exonuclease digestion.In some embodiments, the second adapter comprises one or moredeoxyuridines.

In certain embodiments, a hybridization sequence described herein iscapable of hybridizing “under stringent conditions” to another sequence.Stringent hybridization conditions include, for example, hybridizationin 6× sodium chloride/sodium citrate (SSC) at about 45° C., followed byone or more washes in 0.2×SSC, 0.1% SDS at 50° C.−65° C.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1A-1F shows an exemplary method for adding degenerate sequencesthat serve as unique identifiers to polynucleotides according to certainembodiments described herein.

FIG. 1A shows a library of polynucleotide molecules, containing 3′ (1)and 5′ (2) adapter sequences is melted to form a ss polynucleotidelibrary. (3) represents the captured polynucleotide fragment.

FIG. 1B shows a primer (optionally including a sample index) (4) thathybridizes to the 3′ adapter sequence of the library ss polynucleotide.(5) represents the region of the ds polynucleotide that is hybridized tothe 3′ end of the sample primer (4).

FIG. 1C shows a duplex formed by polymerase dependent primer extension.The library polynucleotide serves as a template for the extensionresulting in the formation of a duplex. The newly formed duplex nowincludes a primer extension product containing the primer sequence, asequence complementary to at least a portion of the 3′ adapter, asequence complementary to the target polynucleotide and a sequencecomplementary to the 5′ adapter sequence (6) refers to the polymerasesynthesized polynucleotide complement to the library polynucleotidesequence and the 5′ adapter. sequence.

FIG. 1D shows the denatured duplex which is combined with an excess ofan oligonucleotide (UID template) (10). The oligonucleotide hybridizesto the 3′ end of the primer extension sequence (i.e., to a sequencecomplementary to the 5′ adapter sequence) (7). The oligonucleotidecontains a variable barcode sequence (a UID) (8) located 5′ to thehybridization sequence. (9) refers to the non extendable 3′ end.

FIG. 1E shows the extension of the primer extension product toincorporate the variable barcode sequence. This product may be heatdenatured and additional rounds of second strand extension performed.(11) represents the newly synthesized polynucleotide (polymerasedependent). (12) represents the non extendable 3′ end.

FIG. 1F shows amplification of the ss primer extension product from FIG.1E that is the product of extension incorporating complementary sequenceto the non hybridizing UID template sequence using, for example, PCR.(13) represents the primer pair used for PCR.

FIG. 2A-2G shows an exemplary method for adding degenerate sequencesthat serve as unique identifiers to DNA according to certain embodimentsdescribed herein.

FIGS. 2A, 2B and 2C are the same as in FIGS. 1A, 1B and 1C.

FIG. 2D shows the product of 5′ exonuclease digestion from the 5′ end ofthe library polynucleotide strand in the duplex while the primerextension product remains and is undigested due to the exonucleaseprotected 5′ end. (14) represents the exonuclease protected end. (15)represents the 5′exonuclease digested region.

FIG. 2E shows an addition of an excess of an oligonucleotide thathybridizes to the 3′ end of the primer extension sequence (i.e., to asequence complementary to the 5′ adapter sequence). The oligonucleotidecontains a variable barcode sequence (a UID) located 5′ to thehybridization sequence.

FIG. 2F shows how the primer extension product is further extended toincorporate the variable barcode sequence. This product may be heatdenatured and additional rounds of second strand extension performed.

FIG. 2G shows amplification of the primer extension product using, forexample, PCR.

FIG. 3A-3G provides exemplary nucleic acid sequences from certainembodiments of the methods and compositions described herein.

FIG. 3A shows an exemplary member of a DNA library that includes anexemplary 3′ adapter sequence (the Read 2 sequencing primer sequence,SEQ ID NO:1) and an exemplary 5′ adapter sequence (the Read 1 sequencingprimer sequence, SEQ ID NO:2).

FIG. 3B shows an exemplary Sample Index Primer sequence (SEQ ID NO:3)aligned with the exemplary member of a DNA library. A sample identifiersequence is shown (index).

P7 identifies the P7 primer recognition sequence.

FIG. 3C shows an exemplary primer extension product sequence (SEQ IDNO:3 and SEQ ID NO:4) aligned with the exemplary member of a DNAlibrary.

FIG. 3D shows the exemplary primer extension product sequence (SEQ IDNO:3, SEQ ID NO:4) aligned with the sequence of the exemplary member ofa DNA library following 5′ exonuclease digestion (SEQ ID NO:1).

FIG. 3E shows the 3′ terminal region of the exemplary primer extensionproduct sequence (SEQ ID NO: 4) aligned with an exemplaryoligonucleotide containing a degenerate barcode sequence (NNNNNNNN) (SEQID NO:5). P5 represents the P5 primer recognition sequence.

FIG. 3F shows the 3′ terminal region of an exemplary primer extensionproduct sequence (SEQ ID NO:6) following performance of a furtherextension step aligned with exemplary oligonucleotide containing adegenerate barcode sequence (NNNNNNNN) (SEQ ID NO:5).

FIG. 3G shows the exemplary further extended primer extension productsequence (SEQ ID NO:6, SEQ ID NO:3) aligned with a first PCR primer(PCR1, SEQ ID NO:8) and a second PCR primer (PCR2, SEQ ID NO:7).

FIG. 4A-4F shows an exemplary method for adding degenerate sequencesthat serve as unique identifiers to polynucleotides according to certainembodiments described herein in which a terminal transferase is used toadd random sequence to the 3′ end of an extension product generated froma polynucleotide library.

FIG. 4A shows a sample from a library of polynucleotide molecules with a3′ adapter and, optionally, a 5′ adapter.

FIG. 4B shows hybridization of a primer to the 3′ adapter sequence. Theprimer may optionally include a sample-specific barcode sequence.

FIG. 4C shows extension of the primer to produce a primer extensionproduct which is hybridized to the sample polynucleotide such that the3′ end of the primer extension product forms a blunt end with the 5′ endof the sample polynucleotide

FIG. 4D shows that the 3′ end of the primer extension product is furtherextended by a terminal transferase, which adds an untemplated randomsequence to the 3′ end of the primer extension product. This untemplatedrandom sequence can serve as a molecule-specific barcode (a UID).

FIG. 4E shows how a ss DNA adapter is ligated to the 3′ end of theprimer extension sequence.

FIG. 4F provides an alternative to FIG. 4E. A ds DNA adapter is ligatedto the untemplated random sequence, with the ds adapter including arandom sequence on one strand can hybridize to the random sequence addedto the primer extension product by the terminal transferase.

FIG. 5A-5E shows an exemplary method for adding degenerate sequencesthat serve as unique identifiers to DNA according to certain embodimentsdescribed herein in which a terminal transferase is used to add randomsequence to the 3′ end of an extension product generated from a DNAlibrary.

FIG. 5A shows a library of polynucleotide molecules with a 3′ adapterand a 5′ adapter. The 5′ adapter is a synthetic oligonucleotide thatcontains modified nucleotides that can terminate replication by a DNApolymerase (an unreplicable sequence).

FIG. 5B shows how a primer hybridizes to the 3′ adapter sequence. Theprimer may optionally include a sample-specific barcode sequence.

FIG. 5C shows that the primer can be extended up to the unreplicablesequence to produce a primer extension product.

FIG. 5D shows extension of the 3′ end of the primer extension product bya terminal transferase, which adds an untemplated random sequence to the3′ end of the primer extension product. This untemplated random sequencecan serve as a molecule-specific barcode (a UID). The sequence added bythe terminal transferase bridges the unreplicable region.

FIG. 5E shows that the polymerase further extends the primer extensionproduct to incorporate a sequence complementary to the remainder of the5′ adapter.

DESCRIPTION OF TERMS

It must be noted that as used herein and in the appended claims, thesingular forms “a”, “an”, and “the” include plural referents unless thecontext clearly dictates otherwise. For example, the term “a primer”refers to one or more primers, i.e., a single primer and multipleprimers. It is further noted that the claims can be drafted to excludeany optional element. As such, this statement is intended to serve asantecedent basis for use of such exclusive terminology as “solely,”“only” and the like in connection with the recitation of claim elements,or use of a “negative” limitation.

The term “sample” as used herein relates to a material or mixture ofmaterials, typically, although not necessarily, in liquid form,containing one or more analytes of interest. In one embodiment, the termas used in its broadest sense, refers to any plant, animal or viralmaterial containing DNA or RNA, such as, for example, tissue or fluidisolated from an individual (including without limitation plasma, serum,cerebrospinal fluid, lymph, tears, saliva and tissue sections), frompreserved tissue (such as FFPE sections) or from in vitro cell cultureconstituents, as well as samples from the environment. DNA molecules ofinterest include genomic DNA, (which could be from the nucleus ororganelle of a cell, or from the genome of a virus), and RNA moleculesof interest include messenger RNAs (mRNAs), microRNAs (miRNAs), long noncoding RNAs (IncRNAs), ribosomal RNAs (rRNAs), transfer RNAs (tRNAs) andthe genome of an RNA virus, etc., which RNAs may be copied to cDNA, ifrequired.

The term “nucleic acid” and “polynucleotide” are used interchangeablyherein to describe a polymer of any length, e.g., greater than about 2bases, greater than about 10 bases, greater than about 100 bases,greater than about 500 bases, greater than 1000 bases, up to about10,000 or more bases composed of nucleotides, e.g., deoxyribonucleotidesor ribonucleotides, and may be produced enzymatically or synthetically.

The term “nucleic acid sample,” as used herein denotes a samplecontaining nucleic acids. A nucleic acid samples used herein may becomplex in that they contain multiple different molecules that containsequences. Genomic DNA, and cDNA samples from a mammal (e.g., mouse orhuman) are types of complex samples. Complex samples may have more than10⁴, 10⁵, 10⁶ or 10⁷ different nucleic acid molecules. Also, a complexsample may comprise only a few molecules, where the moleculescollectively have more than 10⁴, 10⁵, 10⁶ or 10⁷ or more nucleotides. ADNA target may originate from any source such as genomic DNA, cDNA or anartificial DNA construct. Any sample containing nucleic acid, e.g.,genomic DNA, cDNA or RNA made from tissue culture cells or a sample oftissue, may be employed herein.

The term “library of polynucleotide molecules” as used herein may becomplex in that it contains multiple different molecules. Genomic DNA,and cDNA samples from a mammal (e.g., mouse or human) are types ofcomplex samples. Complex samples may have more than 10⁴, 10⁵, 10⁶ or 10⁷different nucleic acid molecules. Also, a complex sample may compriseonly a few molecules, where the molecules collectively have more than10⁴, 10⁵, 10⁶ or 10⁷ or more nucleotides. A polynucleotide (or nucleicacid) may originate from any source such as genomic DNA, cDNA or anartificial DNA construct or from mRNA, microRNAs, long non-coding RNAsor other RNAs of interest. In some embodiments, a library ofpolynucleotide molecules may be an enriched library, in which case thelibrary may have a complexity of less than 10%, less than 5%, less than1%, less than 0.5%, or less than 0.1%, less than 0.01%, less than 0.001%or less than 0.0001% relative to the unenriched sample (e.g., a samplemade from total RNA or total genomic DNA from a eukaryotic cell sample.Molecules can be enriched by methods such as described in US2014/0287468or US 2015/0119261.

The term “nucleotide” is intended to include those moieties that containnot only the known purine and pyrimidine bases, but also otherheterocyclic bases that have been modified. Such modifications includemethylated purines or pyrimidines, acylated purines or pyrimidines,alkylated riboses or other heterocycles. In addition, the term“nucleotide” includes those moieties that contain hapten or fluorescentlabels and may contain not only conventional ribose and deoxyribosesugars, but other sugars as well. Modified nucleosides or nucleotidesalso include modifications on the sugar moiety, e.g., wherein one ormore of the hydroxyl groups are replaced with halogen atoms or aliphaticgroups, are functionalized as ethers, amines, or the likes.

The term “oligonucleotide” as used herein denotes a ss multimer ofnucleotide of from about 2 to 200 nucleotides, up to 500 nucleotides inlength. Oligonucleotides may be synthetic or may be made enzymatically,and, in some embodiments, are 30 to 150 nucleotides in length.Oligonucleotides may contain ribonucleotide monomers (i.e., may beoligoribonucleotides) or deoxyribonucleotide monomers, or bothribonucleotide monomers and deoxyribonucleotide monomers. Anoligonucleotide may be 10 to 20, 11 to 30, 31 to 40, 41 to 50, 51-60, 61to 70, 71 to 80, 80 to 100, 100 to 150 or 150 to 200 nucleotides inlength, for example.

The term “mixture”, as used herein, refers to a combination of elements,that are interspersed and not in any particular order. Examples ofmixtures of elements include a number of different elements that aredissolved in the same aqueous solution. A mixture is not addressable. Toillustrate by example, an array of spatially separated surface-boundpolynucleotides, as is commonly known in the art, is not a mixture ofsurface-bound polynucleotides because the species of surface-boundpolynucleotides are spatially distinct, and the array is addressable.

The term “primer” means an oligonucleotide, either natural or synthetic,that is capable, upon forming a duplex with a polynucleotide template,of acting as a point of initiation of nucleic acid synthesis and beingextended from its 3′ end along the template so that an extended duplexis formed. The sequence of nucleotides added during the extensionprocess is determined by the sequence of the template polynucleotide.Usually primers are extended by a DNA polymerase. Primers are generallyof a length compatible with their use in synthesis of primer extensionproducts, and are usually are in the range of between 8 to 100nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30,20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on, more typically in therange of between 18-40, 20-35, 21-30 nucleotides long, and any lengthbetween the stated ranges. Typical primers can be in the range ofbetween 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 andso on, and any length between the stated ranges. In some embodiments,the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23,24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70nucleotides in length. As noted below, a primer may be “tailed” in thesense that the 3′ end of the primer may hybridize to a target sequencesand the 5′ end of the primer may not hybridize to the target sequence.The term “tailed”, in the context of a tailed primer or a primer thathas a 5′ tail, refers to a primer that has a region (e.g., a region ofat least 12-50 nucleotides) at its 5′ end that does not hybridize to thesame target as the 3′ end of the primer. Primers are usually ss formaximum efficiency in amplification, but may alternatively be ds, e.g.,in the shape of a hairpin. Thus, a “primer” is complementary to atemplate, and complexes by hydrogen bonding or hybridization with thetemplate to give a primer/template complex for initiation of synthesisby a polymerase, which is extended by the addition of covalently bondedbases linked at its 3′ end complementary to the template in the processof DNA synthesis.

The term “duplex,” or “duplexed,” as used herein, describes twocomplementary polynucleotides that are base-paired, i.e., hybridizedtogether.

The term “amplifying” as used herein refers to the process ofsynthesizing nucleic acid molecules that are complementary to one orboth strands of a template nucleic acid. Amplifying a nucleic acidmolecule may include denaturing the template nucleic acid, annealingprimers to the template nucleic acid at a temperature that is below themelting temperatures of the primers, and enzymatically elongating fromthe primers to generate an amplification product. The denaturing,annealing and elongating steps each can be performed one or more times.In certain cases, the denaturing, annealing and elongating steps areperformed multiple times such that the amount of amplification productis increasing, often times exponentially, although exponentialamplification is not required by the present methods. Amplificationtypically requires the presence of deoxyribonucleoside triphosphates, aDNA polymerase enzyme and an appropriate buffer and/or co-factors foroptimal activity of the polymerase enzyme. The term “amplificationproduct” refers to the nucleic acid sequences, which are produced fromthe amplifying process as defined herein.

A “plurality” contains at least 2 members. In certain cases, a pluralitymay have at least 10, at least 100, at least 100, at least 10,000, atleast 100,000, at least 10⁶, at least 10⁷, at least 10⁸ or at least 10⁹or more members.

If two nucleic acids are “complementary”, they hybridize with oneanother under high stringency conditions. The term “perfectlycomplementary” is used to describe a duplex in which each base of one ofthe nucleic acids base pairs with a complementary nucleotide in theother nucleic acid. In many cases, two sequences that are complementaryhave at least 10, e.g., at least 12 or 15 nucleotides ofcomplementarity.

An “oligonucleotide binding site” or “hybridization sequence” refers toa site to which an oligonucleotide hybridizes in a targetpolynucleotide. If an oligonucleotide “provides” a binding site orhybridization sequence for a primer, then the primer may hybridize tothat oligonucleotide or its complement.

The term “sequencing”, as used herein, refers to a method by which theidentity of at least 10 consecutive nucleotides (e.g., the identity ofat least 20, at least 50, at least 100 or at least 200 or moreconsecutive nucleotides) of a polynucleotide are obtained.

The term “next-generation sequencing” refers to the so-calledparallelized sequencing-by-synthesis or sequencing-by-ligation platformscurrently employed by Illumina, Life Technologies, Pacific Bio, andRoche etc.

Next-generation sequencing methods may also include nanopore sequencingmethods or electronic-detection based methods such as Ion Torrenttechnology commercialized by Life Technologies.

The term “strand” as used herein refers to a nucleic acid made up ofnucleotides covalently linked together by covalent bonds, e.g.,phosphodiester bonds. In a cell, DNA usually exists in a ds form, and assuch, has two complementary strands of nucleic acid referred to hereinas the “top” and “bottom” strands. In certain cases, complementarystrands of a chromosomal region may be referred to as “plus” and “minus”strands, the “first” and “second” strands, the “coding” and “noncoding”strands, the “Watson” and “Crick” strands or the “sense” and “antisense”strands. The assignment of a strand as being a top or bottom strand isarbitrary and does not imply any particular orientation, function orstructure.

The term “top strand,” as used herein, refers to either strand of anucleic acid but not both strands of a nucleic acid. When anoligonucleotide or a primer binds or anneals “only to a top strand,” itbinds to only one strand but not the other. The term “bottom strand,” asused herein, refers to the strand that is complementary to the “topstrand.” When an oligonucleotide binds or anneals “only to one strand,”it binds to only one strand, e.g., the first or second strand, but notthe other strand.

The term “hybridizing” or “hybridizes” refers to a process in which anucleic acid strand anneals to and forms a stable duplex, either ahomoduplex or a heteroduplex, under normal hybridization conditions witha second complementary nucleic acid strand and does not form a stableduplex with unrelated nucleic acid molecules under the same normalhybridization conditions. The formation of a duplex is accomplished byannealing two complementary nucleic acid strands in a hybridizationreaction. The hybridization reaction can be made to be highly specificby adjustment of the hybridization conditions (often referred to ashybridization stringency) under which the hybridization reaction takesplace, such that hybridization between two nucleic acid strands will notform a stable duplex, e.g., a duplex that retains a region of doublestrandedness under normal stringency conditions, unless the two nucleicacid strands contain a certain number of nucleotides in specificsequences which are substantially or completely complementary.

The term “extending”, as used herein, refers to the extension of anucleic acid, e.g., a primer or a primer extension product, by theaddition of nucleotides using a polymerase. For example, if a primerthat is annealed to a nucleic acid is extended, the nucleic acid acts asa template for extension reaction.

The term “not extendible”, in the context of an oligonucleotide that isnot extendible at its 3′ end when it is annealed to a target nucleicacid, refers to an oligonucleotide that cannot be extended by a templatepolymerase-dependent polymerase, either because the 3′ end of theoligonucleotide is blocked at the 3′ end (e.g., by a dideoxy nucleotideor any of a multitude of nucleotides that are not substrates for thepolymerase) or because the 3′ end of the oligonucleotide is mis-matchedwith the target, i.e., because one or more nucleotides at the 3′ end ofthe oligonucleotide are not complementary to correspondingly positionednucleotides in the target sequence).

The term “adapter” refers to a sequence that is joined to or can bejoined to another molecule (e.g., ligated or copied onto via primerextension). An adapter can be DNA or RNA, or a mixture of the two. Anadapter may be 15 to 100 bases, e.g., 50 to 70 bases, although adaptersoutside of this range are envisioned. In a library of polynucleotidemolecules that contain an adapter (e.g., a 3′ or 5′ adapter, the adaptersequence used is not present in the DNA sequences under examination(i.e., the sequence in between the adapters). For example, if thelibrary of polynucleotide molecules contains sequences derived frommammalian genomic DNA, cDNA or cDNA, then the sequences of the adaptersare not present in the mammalian genome under study. In many cases, the5′ and 3′ adapters are of a different sequence and are notcomplementary. In many cases, an adapter will not contain a contiguoussequence of at least 8, 10 or 12 nucleotides that is found in the DNAunder examination.

The term “adapter-containing”, in the context of an adapter-containingnucleic acid, refers to either a nucleic acid that has been ligated toan adapter, or to a nucleic acid to which an adapter has been added byprimer extension. In some embodiments, the adapters of a library ofnucleic acid molecules may be made by ligating oligonucleotides to the5′ and 3′ ends of the molecules (or specific sequences of the same) inan initial nucleic acid sample, e.g., DNA or genomic DNA, cDNA.

The term “formed from”, in the context of the primer extension productsthat is “formed from” genomic DNA, cDNA or RNA, refers to primerextension products that are copied from template genomic DNA, cDNA orRNA, or the complement thereof. Such primer extension products areusually DNA.

The term “sample index primer” is a primer that contains a sampleidentifier sequence, i.e., a sequence that can be used to identifyand/or track the source of a polynucleotide in a reaction. In someembodiments, several different samples may be pooled together beforesequencing and each is tagged with a different sample identifiersequence. After sequencing, the sample identifier sequence in the sampleindex primer can be identified in the sequence reads, and the source ofthe sequence read (e.g., which sample) can be determined.

In use, each samples may be tagged with a different sample identifiersequence (e.g., one sequence is appended to each sample, where thedifferent samples are appended to different sequences), and the taggedsamples are pooled. After the pooled sample is sequenced, the sampleidentifier sequence can be used to identify the source of the sequences.In many cases, each sample may be tagged with a single sample identifiersequence, i.e., a sequence that is unique to the sample.

The term “variable barcode sequence” refers to as used herein, refers toa molecular barcode that varies in sequence in a population ofoligonucleotides. In some cases, a population of oligonucleotides maycontain a high complexity variable barcode sequence, in which case, theoligonucleotide may contain a degenerate sequence made up of at least10,000, at least 100,000 or at least 1M different sequences. In otherembodiments, the population of oligonucleotides may contain a lowcomplexity variable barcode sequence. In these embodiments, theoligonucleotide may have a region composed of less than 10,000, lessthan 1,000 or less than 100 sequences. In these embodiments, some primerextension products may be tagged with the same barcode sequence, butthose fragments can still be distinguished by other metrics, e.g., thesequence of the fragment, the sequence of the ends of the fragment etc.In some embodiments, at least 95%, e.g., at least 96%, at least 97%, atleast 98%, at least 99% or at least at least 99.5% of the targetpolynucleotides become associated with a different barcode sequence.Such variable barcode sequences may vary widely in size and composition;the following references provide guidance for selecting sets of variablebarcode sequences appropriate for particular embodiments: Casbon, Nuc.Acids Res., 22 e81 (2011); Brenner, U.S. Pat. No. 5,635,400; Brenner, etal., Proc. Natl. Acad. Sci., 97: 1665-1670 (2000); Shoemaker, et al.,Nature Genetics, 14: 450-456 (1996); Morris, et al., European patentpublication 0799897A1; Wallace, U.S. Pat. No. 5,981,179; and the like.In particular embodiments, a variable barcode sequence may have a lengthin range of from 4 to 36 nucleotides, or from 6 to 30 nucleotides, orfrom 8 to 20 nucleotides. In some cases, an oligonucleotide with avariable barcode sequence can be made by synthesizing an oligonucleotidethat contains a degenerate sequence (e.g., an oligonucleotide that has arun of 4-10 “Ns”, where “N” is G, A, T or C, or any combinationthereof).

A variable barcode sequence can be used to correct sequencing errors andto count how many times an initial molecule is sequenced (e.g., in caseswhere substantially every molecule in a sample is tagged with adifferent sequence, and then the sample is amplified). See, e.g.,Casbon, Nuc. Acids Res., 22 e81(2011).

Among other things, e.g., correcting sequence/PCR errors, allelecalling, etc., a variable barcode sequence can be used to determine thenumber of initial target polynucleotide molecules that have beenanalyzed, i.e., to “count” the number of initial target polynucleotidemolecules that have been analyzed. PCR amplification of molecules thathave been tagged with a variable barcode sequence results in multiplesub-populations of products that are clonally-related in that each ofthe different sub-populations is amplified from a single taggedmolecule. As would be apparent, even though there may be severalthousand or millions or more or molecules in any of the clonally-relatedsub-populations of PCR products and the number of target molecules inthose clonally-related sub-populations may vary greatly, the number ofmolecules tagged in the first step of the method can be estimated bycounting the number of variable barcode sequences associated with atarget sequence that is represented in the population of PCR products.This number is useful because, in certain embodiments, the population ofPCR products made using this method may be sequenced to produce aplurality of sequences. The number of different variable barcodesequences that are associated with the sequences of a targetpolynucleotide can be counted, and this number can be used (along with,e.g., the sequence of the fragment, the sequence of the ends of thefragment) to estimate the number of initial template nucleic acidmolecules that have been sequenced.

Other descriptions of terms may appear throughout the specification.

DETAILED DESCRIPTION

Provided herein are methods and compositions for labeling target nucleicacid molecules with molecular barcodes (e.g., with sample-specificbarcodes and/or with molecule-specific barcodes). In some embodiments,the molecules are labeled one and only one time. In some embodiments,the individual original molecules more than one time (e.g. in astreamlined procedure).

In some embodiments, the methods and compositions described hereinrelate to the labeling of a library of polynucleotide molecules with amolecular barcode. In some embodiments, the library of polynucleotidemolecules comprise a first (3′) adapter and a sequence derived fromgenomic DNA, cDNA or RNA (e.g., a cDNA library). In some embodiments,the library of polynucleotide molecules also includes a second (5′)adapter. In some embodiments, the 5′ adapter includes an unreplicableregion. Modified nucleotides that cannot typically be replicated by DNApolymerases include xanthosine, 3-nitropyrrole, 5-nitroindole, or anyC-glycosidic nucleotide analogs. In some embodiments, the library is ass polynucleotide library. In some embodiments, the library ofpolynucleotide molecules is a complete genomic DNA library or a completemRNA/cDNA library. In some embodiments, the library of polynucleotidemolecules is a target-enriched genomic DNA, library or a target-enrichedmRNA/cDNA library.

In some embodiments, the genomic DNA, cDNA or mRNA/cDNA library isgenerated from the DNA or RNA of a human or a non-human animal. In someembodiments, the non-human animal is a mammal (e.g., a cow, pig, horse,donkey, goat, camel, cat, dog, guinea pig, rat, mouse, sheep, monkey,gorilla, chimpanzee). In some embodiments, the genomic DNA, or mRNA/cDNAlibrary is human.

In certain embodiments, the initial DNA being analyzed may be derivedfrom a single source (e.g., a single organism, virus, tissue, cell,subject, etc.), whereas in other embodiments, the nucleic acid samplemay be pooled with other samples that are from a plurality of sources,where by “plurality” is meant two or more. As such, in certainembodiments, after pooling a nucleic acid sample can contain nucleicacids from 2 or more sources, 3 or more sources, 5 or more sources, 10or more sources, 50 or more sources, 100 or more sources, 500 or moresources, 1000 or more sources, 5000 or more sources, up to and includingabout 10,000 or more sources. A sample identifier sequence may be addedto each of the sources prior to pooling, and the sequence can allow thesequences from different sources to be distinguished after they areanalyzed.

In some embodiments, the DNA or mRNA/cDNA library is generated using DNAfragments obtained from a clinical sample, e.g., a patient that has oris suspected of having a disease or condition such as a cancer,inflammatory disease or pregnancy. In some embodiments, the sample maybe made by extracting fragmented DNA from an archived patient sample,e.g., a formalin-fixed paraffin embedded tissue sample. In otherembodiments, the patient sample may be a sample of cell-free circulatingDNA from a bodily fluid, e.g., peripheral blood. In some embodiments,the DNA in the initial sample may already be fragmented (e.g., as is thecase for FFPE samples and circulating cell-free DNA (cfDNA), e.g.,ctDNA). The fragments in the initial sample may have a median size thatis below 1 kb (e.g., in the range of 50 bp to 500 bp, 80 bp to 400 bp,or 100-1,000 bp), although fragments having a median size outside ofthis range may be used. Cell-free or circulating tumour DNA (ctDNA),i.e., tumour DNA circulating freely in the blood of a cancer patient, ishighly fragmented, with a mean fragment size about 165-250 bp (Newman,et al., Nat Med. 2014 20: 548-54). cfDNA can be obtained by centrifugingwhole blood to remove all cells, and then analyzing the remainingplasma.

The polynucleotide libraries described herein can be generated usingmethods described herein or otherwise known in the art. For example, insome embodiments, genomic DNA, cDNA or RNA is fragmented into small dspolynucleotide fragments followed by ligation of adapters to the ends ofthe polynucleotide fragments. Where mRNA is the source of the library oftarget DNA to be analyzed, cDNA formed by reverse transcriptase isligated to adapters. As used herein, the term “adapter” refers to aregion of known sequence located either 3′ or 5′ of a target DNAsequence in a DNA library. In some embodiments, an adapter is at least10, 15, 20, 30, 40, 50, 60, or 70 nucleotides in length. When both 3′adapters and 5′ adapters are used, the 3′ adapters preferably havedifferent sequences from the 5′ adapters. Exemplary library preparationmethods are described in US 2014/02867468, U.S. patent application Ser.No. 13/513,726 and U.S. Pat. No. 8,288,097, each of which is herebyincorporated by reference. Target enriched libraries described in US2014/02867468, which is hereby incorporated by reference.

In some embodiments, a primer extension product (also referred to as asecond strand DNA) is made from the library of polynucleotide molecules.

In some embodiments, the primer extension product is made such that thelibrary of polynucleotide molecules is labeled with a library-specificbarcode sequence located 5′ of the sequence complementary. In someembodiments, the primer extension product is made by contacting thelibrary of polynucleotide molecules with sample index primers, thesample index primers comprising, in 5′ to 3′ order, a sample-specificbarcode sequence and a first hybridization sequence, such that the firsthybridization sequence hybridizes to a region of the first adapter ofthe library of polynucleotide molecules and then incubating thehybridized DNA molecules with a polymerase such that the polymeraseextends 3′ end of the sample index primers to form primer extensionproducts comprising, in 5′ to 3′ order, a sample-specific barcodesequence, a sequence complementary to at least a portion of the firstadapter, a sequence complementary to the genomic DNA, cDNA or mRNAderived sequence, and a sequence complementary to the second adapter. Asused herein, the terms “hybridize” or “hybridization” refer to thehydrogen bonding of complementary or substantially complementary DNAand/or RNA sequences to form a duplex molecule. In some embodiments, anucleic acid sequence is referred to as being “capable of hybridizing”to another nucleic acid sequence if there is at least about 65%, atleast 75% or at least 90% sequence complementary over a stretch of atleast 14 nucleotides to 25 nucleotides. In some embodiments, “a sequencecapable of hybridizing” or a “hybridization sequence” is complementaryto a target sequence and is at least 8, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20, 40, or 60 nucleotides in length.

Primer extension products are generated by use of a primer. The term“primer” as used herein refers to an oligonucleotide that is capable,upon forming a duplex with a polynucleotide template, of acting as apoint of initiation of nucleic acid synthesis and being extended fromits 3′ end along the template so that an extended duplex is formed.Usually primers are extended by a mesophilic DNA polymerase, such asKlenow, Klenow (exo-), Bsu, or DNA polymerase I or a thermophilic DNApolymerase such as Taq, KlenTaq®, Tth, Bst, Bst (large fragment), Vent®,Deep Vent®, Q5®, and Phusion® (all commercially available from NewEngland Biolabs, Ipswich, Mass.). Primers can be ss, ds, or partially ssand partially ds. If ds, the primer is generally first treated toseparate its strands before being used to prepare extension products.This denaturation step is typically affected by heat, but mayalternatively be carried out using alkali, followed by neutralization.Thus, a primer has at least a 3′ sequence complementary to a template,and complexes by hydrogen bonding or hybridization with the template togive a primer/template complex for initiation of synthesis by apolymerase, which is extended by the addition of covalently bonded baseslinked at its 3′ end complementary to the template in the process of DNAsynthesis.

In some embodiments, primers described herein are designed such that its3′ end hybridizes to a 3′ adapter leaving a 5′ tail. 5′ of the regioncapable of hybridizing to the 3′ adapter the primer may include anon-random index sequence (a MID). In some embodiments, the MID is 2-15nucleotides in length for example 6-10 nucleotides for example, 8nucleotides. In some embodiments, the primer further includes a platformspecific sequencing tag 5′ of the MID and/or the adapter hybridizationsequence. The MID may be positioned between the platform specificsequencing tag and the adapter hybridization sequence. An example of aplatform specific tag is an Illumina sequence tag for MiSEQ® such as P7(IIlumina, San Diego, Calif.). The MID can be used to identify eachsample in a pool of samples sequenced together in one run on a multiplexsequencing platform.

In some embodiments, a ds molecule comprising the primer extensionproduct and the DNA library template is formed during a primer extensionprocess. In some embodiments, this ds molecule is denatured into ss andthen contacted with a UID-template oligonucleotide. In some embodiments,the ds molecule is treated with a 5′ exonuclease that digests the 5′ endof the original DNA library molecule. In such embodiments, the 5′ end ofthe primer extension product is protected from exonuclease digestion(e.g., through the absence of a 5′ phosphate or the presence of ablocking nucleotide).

In some embodiments, the primer extension product is labeled bycontacting primer extension products with UID-template oligonucleotides.In some embodiments, the UID-template oligonucleotides include, in 3′ to5′ order, a non-extendable 3′ end, a second hybridization sequence, anda variable barcode sequence (e.g., a degenerate random sequence of 2-15nucleotides, 6-10 nucleotides or 8 nucleotides) such that the secondhybridization sequence hybridizes to a 3′ terminal region of thesequence complementary to the second adapter and then incubating thehybridized primer extension products with a polymerase such thepolymerase extends the 3′ end of the primer extension products to form afurther extended primer extension product comprising, in 5′ to 3′ order,the sequence complementary to at least a portion of the first adapter,the sequence complementary to the genomic DNA, cDNA or mRNA derivedsequence, the sequence complementary to the second adapter, and asequence complementary to the variable barcode sequence of theoligonucleotide. The UID-template oligonucleotide may also include a 5′commercial platform sequence tag. The UID-template oligonucleotide canbe used in a ratio to target of 1:1 to 1×10¹², or less than 1:1 or morethan 1×10¹². Mesophilic polymerases useful in the further extension ofthe primer extension product include Klenow, Klenow (exo-), Bsu, or DNApolymerase I or thermophilic DNA polymerases such as Taq, KlenTaq, Tth,Bst, Bst (large fragment), Vent, Deep Vent, Q5, or Phusion.

In some embodiments, a barcode-labeled product is amplified (e.g., usingPCR). Examples of nucleic acid amplification processes that can be usedinclude, but are not limited to, polymerase chain reaction (PCR), ligasechain reaction (LCR), strand displacement amplification (SDA),transcription mediated amplification (TMA), self-sustained sequencereplication (3SR), Qβ replicase based amplification, nucleic acidsequence-based amplification (NASBA), repair chain reaction (RCR),boomerang DNA amplification (BDA) and/or rolling circle amplification(RCA).

In some embodiments of the methods described herein, a sequencingprocess is performed on the barcode-labeled product. In someembodiments, the sequencing process is a multiplex sequencing process.Nucleic acid sequencing processes include, but are not limited to chaintermination sequencing, sequencing by ligation, sequencing by synthesis,pyrosequencing, ion semiconductor sequencing, single-molecule real-timesequencing, 454 sequencing, and/or Dilute-‘N’-Go sequencing.

In some embodiments, the sequencing process is a NGS process. NGSplatforms include, but are not limited to, Massively Parallel SignatureSequencing (Lynx Therapeutics, Hayward, Calif.); 454 pyrosequencing (454Life Sciences/Roche Diagnostics, Branford, Conn.); solid-phase,reversible dye-terminator sequencing (Solexa/Illumina, San Diego,Calif.); SOLID® technology (Applied Biosystems/Life Technologies, GrandIsle, N.Y.); Ion semiconductor sequencing (Ion Torrent™ LifeTechnologies, Grand Isle, N.Y.); and DNA nanoball sequencing (CompleteGenomics, Mountain View, Calif.). Descriptions of certain NGS platformscan be found in the following: Shendure, et al., Nature, 26:1135-1145(2008); Mardis, Trends in Genetics, 24:133-141 (2007); Su, et al.,Expert Rev Mol Diagn, 11(3):333-43 (2011); and Zhang et al., J GenetGenomics, 38(3):95-109 (2011), each of which are hereby incorporated byreference.

In some embodiments, provided herein is a kit for performing a methoddescribed herein. The term “kit” refers to any delivery system fordelivering materials or reagents for carrying out a method describedherein. In some embodiments, such delivery systems can include systemsthat allow for the storage, transport, or delivery of reaction reagents(e.g., probes, enzymes, adapters, primers etc. in the appropriatecontainers) and/or supporting materials (e.g., buffers, writteninstructions for performing the assay etc.) from one location toanother. For example, in some embodiments kits include one or moreenclosures (e.g., boxes) containing the relevant reaction reagentsand/or supporting materials. Such contents may be delivered to theintended recipient together or separately. For example, a firstcontainer may contain an enzyme for use in an assay, while a secondcontainer contains probes. A kit may be formulated for selecting andenriching target templates from a nucleic acid sample containingnon-target and target sequences. The kit may include one or more primersas described herein adapters; nucleases; ligase; polymerase(s); buffers;and nucleotides. The kit may further comprise one or more buffersolutions and standard solutions for the creation of a DNA library.

In some embodiments, a terminal transferase is used to generatedegenerate sequences for molecule identification (UIDs). For example, asillustrated in FIG. 4, in some embodiments primer extension products oforiginal DNA library molecules are further extended by terminaltransferase, generating a random sequence at the 3′ end of the primerextension product, which extends beyond the 5′ end of the original DNAlibrary molecule in a random manner. A ss adapter can then be attachedto the 3′ end of the extension product or alternatively be introducedthrough a ds adapter having a random sequence at the 3′ end of thesecond strand. As illustrated in FIG. 5, in some embodiments theterminal transferase forms a bridge over an unreplicable region of a 5′adapter of the original DNA library molecule, thereby permitting furtherextension of the primer extension product. Modified nucleotides thatcannot typically be replicated by DNA polymerases include xanthosine,3-nitropyrrole, 5-nitroindole, or any C-glycosidic nucleotide analogs.

The term “non-naturally occurring” refers to a composition that does notexist in nature. In the context of a nucleic acid, the term“non-naturally occurring” refers to a nucleic acid that contains: a) asequence of nucleotides that is different to a nucleic acid in itsnatural state, b) one or more non-naturally occurring monomers (whichmay result in a non-natural backbone or sugar that is not G, A, T or C)and/or C) may contain one or more other modifications (e.g., an addedlabel or other moiety) to the 5′-end, the 3′-end, and/or between the 5′-and 3′-ends of the nucleic acid.; c) a combination of sequences thatwould not occur together in nature but here have been synthesizedenzymatically or by chemical synthesis.

In the context of a composition or preparation, the term “non-naturallyoccurring” refers to: a) a combination of components that are notcombined by nature, e.g., because they are at different locations, indifferent cells or different cell compartments; b) a combination ofcomponents that have relative concentrations that are not found innature; c) a combination that lacks something that is usually associatedwith one of the components in nature; d) a combination that is in a formthat not found in nature, e.g., dried, freeze dried, crystalline,aqueous; e) a combination that contains a component that is not found innature. For example, a preparation may contain a buffering agent (e.g.,Tris, HEPES, TAPS, MOPS, tricine or MES), a detergent, a dye, a reactionenhancer or inhibitor, an oxidizing agent, a reducing agent, a solventor a preservative that is not found in nature and/or (f) the combinationis contained in a non-cell container such as a reaction vessel, wherethe term reaction vessel refers to a tube, or well in which reagents maybe in solution or immobilized where immobilization may occur on thesurface of the reaction vessel or on a bead in the reaction vessel.

All references cited herein are incorporated by reference.

EXAMPLES

While the examples provided herein describe specific temperatures,reagents, sequences, incubation times, buffers and other reactionconditions, such conditions exemplary and are not intended to belimiting. Similarly, the order of steps is described as an example andthe order of steps may be modified and that certain steps may be addedor deleted as expedient.

Example 1: Introduction of a UID into a DNA Library

Degenerate sequences were added to a DNA library using the methodsgenerally illustrated in FIGS. 1A-1F and 2A-2G.

A second-strand synthesis reaction was performed on library moleculessuch that the newly formed primer extension product contained on the 5′end: the P7 sequence, the indexing sequence and the read 2 sequencingprimer binding sequence; and on the 3′ end: the read 1 sequencing primerbinding sequence (FIGS. 3A-3C). Specifically, a library preparation ofDNA fragments ligated to hairpin adapters at both ends were captured onstreptavidin beads in 43 μL of water. A primer extension was performedby forming a reaction solution containing 41 μL of this DNA librarypreparation with 1 μL of 10 mM dNTPs, 5 μL of 10× ThermoPol® ReactionBuffer (New England Biolabs, Ipswich, Mass.), 2 μL of NEBNext® MultiplexPrimer (New England Biolabs, Ipswich, Mass.) with 6 phosphorothioatebonds at the 5′ end and 1 μL Q5 High-Fidelity DNA Polymerase, which wasincubated at 72° C. for 10 minutes.

In some cases, the complex between the primer extension product and theDNA library template was denatured by incubation at 95° C. prior toaddition of a UID template oligonucleotide, as depicted in FIG. 1A-1F.In other cases, the library template was digested from its 5′ end byusing Lambda exonuclease as depicted in FIG. 2A-2G. In such instances,the primer extension product was protected at the 5′ end byphosphorothioate bonds whereas the 5′ terminal phosphate on the originallibrary molecules was not. To digest the 5′ end of the librarymolecules, 1 μL of lambda exonuclease was added and incubated at 20° C.for 10 minutes. USER could be used in place of lambda exonuclease tosimilar effect.

A UID template was added to the 3′ end of the primer extension productcreated in the second-strand synthesis reaction (FIG. 3E-3F). The 3′ endof the primer extension product (specifically the read 1 sequencingprimer binding sequence) was used as a defined region to hybridize withthe 3′ end of a UID template oligonucleotide, after which the 3′ end ofthe primer extension product was further extended to incorporate a copyof the UID template sequence. Specifically, UID-templateoligonucleotides containing a region complementary to the read 1sequencing primer binding sequence, eight random nucleotides, and the P5(Illumina sequence) were combined with a thermostable DNA polymerase,dNTPs and the primer extension product. This reaction was performed wasperformed in a 100 μL reaction mixture as follows that included 50 uL ofprimer extension reaction, 0.5 of μL Taq DNA Polymerase; 1 μL of 10 mMdNTPs, 2 μL of UID-template (0.05 μM), 5 μL 10× ThermoPol ReactionBuffer, and 41.50 μL of water. This reaction was cycled four times toensure that a maximal number of molecules were barcoded

The resulting barcoded primer extension product was combined with PCRprimers that annealed to P5 and P7 primer recognition sequences suchthat only molecules with molecular IDs were amplified. The PCR reactionwas run for 23 cycles using 2 μL of 100 μM PCR1 and PCR2 primers (FIG.3G). The amplification product was purified using 0.9× Agencourt AmPure® (Beckman Coulter, Brea, Calif.) beads and eluted in a final volumeof 20 pi, twice. The amplification products were then sequenced on theIllumina MiSeq platform.

When the process illustrated in FIG. 1 was performed, of the maximum65,536 unique barcodes that can be generated with an 8 base randombarcode, 64,381 were identified in the sequenced amplification product.When the process illustrated in FIG. 2 was performed, 56,551 uniquebarcodes were identified after sequencing.

Example 2: Introduction of a UID into a DNA Library Using a TerminalTransferase

Degenerate sequences are added to a DNA library using the methodsgenerally outlined in FIGS. 4A-4F and 5A-5E.

A library preparation of DNA fragments ligated to adapters at both endsis captured on streptavidin beads in 43 μL water. A primer extension isperformed by combining 41 μL of the captured library molecules with 1 μLof 10 mM dNTPs, 5 μL of 10× ThermoPol Reaction Buffer and 2 μL ofNEBNext Multiplex Primer, which is then incubated for 5 minutes at 95°C. followed by a ramp down to 30° C. at 0.1° C./second. On ice, 1 μL ofTaq DNA Polymerase is added and the reaction solution is incubated at72° C. for 5 minutes. The magnetic beads are immobilized using a magnetand the supernatant is transferred to a new tube to which 5 μL ofterminal transferase is added. The terminal transferase reactionsolution is incubated at 20° C. for 10 minutes, 75° C. for 10 minutesand then placed on ice. An adapter is added to the primer extensionproduct by adding 20 μL of 5× Quick Ligation Buffer (New EnglandBiolabs, Ipswich, Mass.), 10 μL of Quick T4 DNA Ligase (New EnglandBiolabs, Ipswich, Mass.), 10 μL of dsDNA adapter with degenerate 3′ end(0.05 μM) and 5 μL of H₂O to the reaction solution. This solution isincubated overnight at 20° C. The resulting barcode labeled library ispurified with 0.9× Agencourt AmPure beads and eluted in 23 μL. Thepurified labeled library is PCR amplified using 25 μL of LongAmp® Taq 2×Master Mix (New England Biolabs, Ipswich, Mass.) and 2 μL of 100 μM PCR1& PCR2 primers. The resulting amplification product is purified using0.9× Agencourt AmPure beads and eluted in a final volume of 20 μl.

Using the method described in FIGS. 4A-4F and 5A-5E, terminaltransferase can add between 1 and 30 random nucleotides, theoreticallygenerating up to 1×10¹⁸ unique barcode sequences. In the methoddescribed in FIG. 4A-4F, the library is generated containing 6-10 randomnucleotides, generating a theoretical maximum of between 4,096 and1,048,576 unique sequences. The libraries are sequenced on the Illuminaplatform and analyzed by standard methods.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned herein arehereby incorporated by reference in their entirety as if each individualpublication, patent or patent application was specifically andindividually indicated to be incorporated by reference. In case ofconflict, the present application, including any definitions herein,will control.

EQUIVALENTS

Those skilled in the art will recognize or be able to ascertain using nomore than routine experimentation, many equivalents to the specificembodiments of the invention described herein. Such equivalents areintended to be encompassed by the following claims.

What is claimed is:
 1. A method, comprising: (a) hybridizing a libraryof polynucleotide molecules comprising, in 3′ to 5′ order, a firstadapter, a sequence derived from genomic DNA, cDNA or RNA, and a secondadapter with a sample index primer, the sample index primer comprising,in 5′ to 3′ order, a sample-specific barcode sequence and a firsthybridization sequence, wherein the first hybridization sequencehybridizes to a region of the first adapter; (b) incubating thehybridized polynucleotide molecules formed in step (a) with a polymerasesuch that the polymerase extends the 3′ end of the sample index primerto form primer extension products comprising, in 5′ to 3′ order, asample-specific barcode sequence, a sequence complementary to at least aportion of the first adapter, a sequence complementary to the genomicDNA, cDNA or RNA derived sequence, and a sequence complementary to thesecond adapter; (c) hybridizing the primer extension products formed instep (b) with synthetic oligonucleotides comprising, in 3′ to 5′ order,a non-extendable 3′ end, a second hybridization sequence, and a variablebarcode sequence, wherein the second hybridization sequence hybridizesto a 3′ terminal region of the sequence complementary to the secondadapter; (d) incubating the hybridized primer extension products of step(c) with a polymerase such the polymerase extends the 3′ end of theprimer extension products to form a further extended primer extensionproduct; and (e) optionally amplifying the further extended primerextension products formed in (d)
 2. The method according to claim 1,further comprising the step of denaturing the primer extension productsfrom the polynucleotide molecules between steps (b) and (c).
 3. Themethod according to claim 1, further comprising removing a 5′ terminalsequence from the polynucleotide molecules of the library after step (b)and prior to step (c), leaving a 3′ overhang.
 4. The method of claim 3,wherein the 5′ terminus of the sample index primer is protected fromexonuclease digestion and the 5′ terminal sequence of thepolynucleotides molecules of the library is removed by an exonuclease.5. The method of claim 3, wherein the second adapter comprises one ormore deoxyuridines, and the 5′ terminal sequence of the polynucleotidesmolecules of the library is removed using uracil DNA glycosylase (UDG)and DNA endonuclease VIII.
 6. The method according to claim 1, whereinstep (d) further comprises amplifying by PCR the further extended primerextension product.
 7. The method according to claim 6, furthercomprising the step of performing a sequencing process on theamplification products.
 8. The method, comprising performing the methodof claim 1 to label a library of polynucleotide molecules with uniquevariable sequences; amplifying the library of polynucleotide molecules;and sequencing the unique variable sequence. 9.-35. (canceled)